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Abstract 

A phylogenetic birth-and-death model is a probabilistic graphical 
model for a so-called phylogenetic profile, i.e., the size distribution for 
a homolog gene family at the terminal nodes of a phylogeny. Profile 
datasets are used in bioinformatics analyses for the inference of evolu- 
tionary trees, and of functional associations between gene families, as 
well as for the quantification of various processes guiding genome evo- 
lution. Here we describe the mathematical formalism for phylogenetic 
birth-and-death models. We also present an algorithm for computing 
the likelihood in a gain-loss-duplication model. 

For background information on phylogenetic birth-and-death models, see |CM06j 
(preprint available under http : // arxiv . org/ abs/ q-bio/ 0509037vl ) . Here 
we give a self-containg comprehensive review, concentrating on the mathe- 
matical results. We describe our new algorithm for a very general class of 
gain-loss-duplication models. 



1 Introduction 

A phylogenetic birth-and-death mo(ie/ formalizes the evolution of an organism- 
specific census variable along a phylogeny. The phylogeny is a rooted tree, 
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i.e., a connected acyclic graph in which the edges are directed away from 
a special node designated as the tree root; the terminal nodes, or leaves, 
are bijectively labeled by the organisms. The model specifies edge lengths, 
as well as birth-and-death processes |Ros96l IKen49] acting on the edges. 
Let £(r) denote the set of edges, and let V(T) denote the node set of the 
tree. Populations of identical individuals evolve along the tree from the root 
towards the leaves by Galton- Watson processes. At non-leaf nodes of the tree, 
populations are instantaneously copied to evolve independently along the 
adjoining descendant edges. Let the random variable ^(x) G N = {0, 1, 2, . . . } 
denote the population count at every node x G V(T). Every edge xy G £(T') 
is characterized by a loss rate fi^y, a duplication rate X^^y and a gain rate K^y. 
If [X(t) : if: > O) is a linear birth-and-death process [Ken49irTak62] with these 
rate parameters, then 

P|e(2/) = m I = n} = P|x(t,y) = m | X(0) = n}, 

where txy > is the edge length, which defines the time interval during which 
the birth-and-death process runs. The joint distribution of (^(x) : x G V(T)) 
is determined by the phylogeny, the edge lengths and rates, along with the 
distribution at the root p, denoted as 7(n) = P{^(p) = n}. Specifically, for 
all set of node census values (n^: x G V(T)), 

pjVxG V(r): =n,} =7(n,) J] w,y[ny\n,] (1) 

xy&e-{T) 

where W2,.j^[m|n] = p|x(tx'j/) = m X{0) = n| denotes the transition proba- 
bility on the edge xy for the Markov process operating there. 

It is assumed that one can observe the population counts at the termi- 
nal nodes (i.e., leaves), but not at the inner nodes of the phylogeny. Since 
individuals are considered identical, we are also ignorant of the ancestral 
relationships between individuals within and across populations. The popu- 
lation counts at the leaves form a phylogenetic profile. Our central problem 
is to compute the likelihood of a profile, i.e., the probability of the observed 
counts for fixed model parameters. 

The transient distribution of linear birth-and-death processes is well- 
characterized |KM58l IKen49t ITak62] , as shown in Table [TJ Table [1] precisely 
states the distribution of xenolog and inparalog group sizes. 

The distribution of population counts can be obtained analytically from 
the constituent distributions of Table [H as shown by the following lemma. 
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Case Condition Transient distribution 



Group 



GLD 


K>0,A>0 p. 


X{t) 


= n 


X{0) 


= 


■ = NegativeBinomial(n; ^, 


xenolog 


GL 


K>0,A = P' 


'x{t) 


= n 


X{0) 


= 


■ = Poisson (n;r) 


xenolog 


DL 


/t = 0,A>0 P' 


'x{t) 


= n 


X{0) 


= 1 


■ = ShiftedGeometric(n;p, 


inparalog 


PL 


K = 0,A = p. 


'x{t) 


= n 


X{0) 


= 1 


■ = Bernoulli(n; 1 —p) 


inparalog 



Pctrameters: 



1 - e-^* 



r = K- 



P = 



jjL — jjLe 



H - Ae-('^-^)* 
Xt 

^ = ^ = TTAi 



and q 



A - Ae-(^-^)* 
IJ, - Ae-(''-^)* 



if A 7^ /u, 
if A = /X. 



Distributions: 

NegativeBinomial(n; 9, q) 

ShiftedGeometric(n;p, q') = < 



(1-^)^ ifn = 

p if n = 0; 

{l-p){l-q) ifn = l 
(1 ifn>l. 



Poisson(n;r) = e — 
n! 



Bernoulli(n; 1 — p) = ShiftedGeometric(n;p, 0) = 



P 



if n = 



I — p if n = 1. 



Table 1: Transient behavior of linear birth- and-death processes with loss 
rate > 0, gain rate n and duplication rate A: gain-loss-duplication (GLD), 
gain- loss (GL), duphcation-loss (DL) and pure-loss (PL) models. The last 
column of the table shows the relevant group for computing transition proba- 
bilities in a phylogenetic birth-and-dcath model. For the meaning of xenolog 
and inparalog groups, see the main text. 
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Lemma 1. Let (Cj : « = 1, 2, . . . ) be independent random variables that have 
identical, shifted geometric distributions with parameters p and q. Let r) 
be a discrete nonnegative random variable that is independent from with 
probability mass function ¥{r] = m} = H{m). Define w[m\n] = F{ri + 

J2i=i Ci — ''^l /^'^ m,n > 0, and w*[m\n] = F{r] + J2i=i d — ''^S ^0 > 0} 
for all m > n > . These values can be expressed recursively as follows. 

w[m\0] = H{m) {m > 0} (2a) 

w[0\n] = p ■ w[0\n - 1] {n > 0} (2b) 

w[l\n] = p ■ w[l\n - 1] + {1 - p){l - q) ■ w[0\n - 1] {n > 0} (2c) 

w[m\n] = q ■ w[m — l\n] {n>0,m>l} (2d) 
+ {1 — p — q) ■ w[m — l|n — 1] 
+ p ■ w[m\n — 1] 

Furthermore, 

w*[m\0] = H{m) {m > 0} (3a) 

w*[n\n] = {l-p){l-q)-w*[n-l\n-l] {n > 0} (3b) 

?i;*[m|n] = q ■ w*[m — l\n\ {m > n > 0} (3c) 

+ {1 — p){l — q) ■ w*[m — l\n — 1] 

For every edge xy, Equation ([2]) provides the transition probabilities Wa;y[m|n] 
w[m|n] in ([1]), when p, q and H{m) are taken from Table [1] for the process 
operating on the edge xy. Equation ([2]) is used below in our formulas. 



2 Surviving lineages 

A key factor in inferring the likelihood formulas is the probability that a 
given individual at a tree node x has no descendants at the leaves within 
the subtree rooted at x. The corresponding extinction probability is denoted 
by Dx. An individual at node x is referred to as surviving if it has at least 
one progeny at the leaves descending from x. Let S(x) denote the number of 
surviving individuals at each node x. The distribution of S(x) can be related 
to that of ^{x) by 

P{S(x) =m} = f2h^') Dlil - D,rF{^ix) =m + t}. (4) 
i=o ^ ^ ^ 
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The next two lemmas characterize the number of surviving xenologs and 
inparalogs: they follow the same class of distributions as the total number 
of xenologs and inparalogs. 

Lemma 2. For every edge xy G £(T), let Gy{n) denote the probability that 
there are n surviving members within an inparalog group at y. Then Gy{n) = 
ShiftedGeometric(ra; p', g') with 

, p[l-Dy) + {l-q)Dy ^ , q{l-Dy) 



p' = ^ and q' 



qDy 1 - qDy 

Lemma 3. For every edge xy G £(T), let Hy{n) denote the probability that 
there are n xenologs at y that survive. If Xxy = 0, then Hy{n) = Poisson(n; r') 
where r' = r(l — Dy). If Xxy > 0, then IIy{n) = NegativeBinomial(n; 6*, g'). 

In the formulas to follow, we use the probabilities i(;*[m|n], which apply 
Lemma [1] to surviving populations on edge xy: w*[m\n\ = w*[m\n], where 
the latter is defined by Equation ([3]) with settings p *— p' , q *— g', H{rn) <— 
ifj; . (m) from Lemmas [2] and [31 

Lemma [2] provides the means to compute extinction probabilities in a 
postorder traversal of the phylogeny. 

Lemma 4. If x is a leaf, then = 0. Otherwise, let x be the parent 
of xi, X2, . . . , Xc- Then can be written as 

c 

Dx = l[Gx,iO). (5) 



3 Conditional likelihoods 

Let ^{T) C V(T) denote the set of leaf nodes. A phylogenetic profile $ is 
a function ^(T) {0, 1,2,...}, which are the population counts observed 
at the leaves. Define the notation = ($(x): x G for the partial 

profile within a subset £' C XL(T). Similarly, let = (^(x): x G £') 

denote the vector- valued random variable composed of individual population 
counts. The likelihood of $ is the probability 

L = p{e(i:(T)) =$}. (6) 
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Let Tj. denote the subtree of T rooted at node x. Define the survival count 
range for every node x G V{T) as = Yliy<^L{Tx) ^iv)- The survival 
count ranges are calculated in a postorder traversal, since 

^ ^ f$(a;) if X is a leaf 

lEyGchiidren(x)^3/ Otherwise. 

We compute the likelihood using conditional survival likelihoods defined 
as the probability of observing the partial profile within given the number 
of surviving individuals E{x): 

LM = P{?(/:(r,)) = <^{^{T,)) I E{x) = n}. 

For m > Mj;, L^[m] = 0. For values m = 0,1,..., M^., the conditional 
survival likelihoods can be computed recursively as shown in Theorem O 
below. 



Theorem 5. If node x is a leaf, then 

Lx[n] = 



ifn^ 

1 ifn = <l>(x) 



If X is an inner node with children xi,...,Xc, then Lx[n] can be expressed 
using Lx^[-] and auxiliary values Ai-. and Bi-.^. for i = 1, . . . ,c in the following 
manner. Let u!*^\m\s] denote the transition probability in LemmalJl applied 
to surviving individuals at Xi, using the distributions H^X') from Lemma\^ 
and GxX') fi"om Lemma Let M[j] = Yli=i for all j = 1, . . . , c and 
M[0] = 0. Define also D[j] = Y[i=i^xi{0) o,nd D[0] = 1. Auxiliary values 
Bi-t,s are defined for all i = 1, . . . , c, t = 0, . . . , M[i — 1] and s = 0, . . . , M^. 
as follows. 

B^.,o,s = Yl <.M^]LxAM {0<s< (8a) 

m=0 

Br,t,M.^ = G,MBr,t-l,M., {0 < t < M[l - 1]} (8b) 

Bi-t,s = Bi.t-l,s+l + GxX^)Br,t-l,s I 0<7<M[j-l] } (S*^) 
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For alii = 1, . . . , c and n = 0, . . . , M[i], define Ai-n 

Ai;„ = (l-D[l])-"5i;o,„; (9a) 
A-n = {1 - D[7]y Q'^^om\a\{s;n,D[t-l])Ai.,,tB,.t,s, (9b) 



0<t<M[i~l] 
0<s<M^. 
t+s=n 



where i > 1 in fpjbj) . 

For all n = 0, . . . , M^, Ljji] = A^-^. 



The complete likelihood is computed as 
L=Y,Lp[m]F{E{p) = m} 

= E f E ^(^ + ^) t ') ^^^^ " • 

m=0 \j=0 ^ ^ / 

For some parametric distributions 7, the infinite sum in (fTOj) can be re- 
placed by a closed formula for P{S(p) = m}. Theorem below considers the 
stationary distributions for gain-loss-duplication and gain-loss models. 

Theorem 6. For negative binomial or Poisson population distribution at the 
root, the likelihood can be expressed as shown below. 

1. If'y{n) = Poisson (ri; r), then 

P{S(p) = m} = Poisson(m; r') (Ha) 

with r' = r(l — Dp). 

2. If'j{n) = NegativeBinomial(n; ^, g), then 

¥{E{p) =m} = NegativeBinomial(m; 9, q') (lib) 
with g' = 41^. 

3. If ^{p) has a Bernoulli distribution, i.e., 7(0) = 1 — 7(1) = 1 — p, 
then 

L = L,[0] + p{l - Dp) {Lp\l] - L,[0]) (11c) 
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Consequently, the likelihood for a Poisson distribution at the root is com- 



where T is the mean family size at the root. 

4 Algorithm 

The algorithm we describe computes the likelihood of a phylogenetic profile 
for a given set of model parameters. Algorithm ComputeConditionals 
below proceeds by postorder (depth-first) traversals; the necessary variables 
are calculated from the leaves towards the root. The loop of Line [T] computes 
the transition probabilities extinction probabilities D. and survival 

count ranges M. The loop of Line [9] carries out the computations suggested 
by Theorem [5l 

Theorem 7. Let T be a phylogeny with n nodes where every node has at 
most c* children. Let h denote the tree height, i.e., the maximum number 
of edges from the root to a leaf The ComputeConditionals algorithm 
computes the conditional survival likelihoods for a phylogenetic profile ^ onT 
in 0(^M^h+c*{Mh+n)^ time, where M = Mp = $(x) is the total number 
of homologs. 

If c* is constant, then the running time bound of Theorem [7] is 0{M'^h+n). 
For almost all phylogenies in a Yule-Harding random model, h = 0(logr2,), 
so the typical running time is 0(M^ log n). For all phylogenies, h < n — 1, 
which yields a 0{M'^n) worst-case bound. 



puted as 




(12) 



m=0 
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ComputeConditionals 
Input: phylogenetic profile $ 



1 for each node x G V(T) in a postorder traversal do 

2 Compute the sum of gene counts by ([7|) . 

3 Compute using ([5]). 

4 if a; is not the root 

5 then let y be the parent of x. 

6 for n = 0, . . . , do 

7 for m = 0, . . . , do 

8 compute w*^[TO|rt] using ([3|) with Hx{-) and G2;(-) from Lemmas [3] and [21 

9 for each node x e V(T) in a postorder traversal do 

10 if a; is a leaf 

11 then for aU n <— 0, . . . , <J>(a;) do set Lx[n] *— {n = <I>(x)} 

12 else 

13 Let xi, . . . ,Xc be the children of x 

14 Initiahze M[0] ^ and D[0] ^ 1 

15 for i ^ 1, . . . , c do 

16 set M[i] ^ M[i-1]+ Mx, and D[i] ^ D[i - 1] ■ D^, 

17 for aU t ^ 0, . . . , M[i - 1] and s ^ 0, ... , M^^ do 

18 compute Bi-t.s by Eqs. ^ 

19 if i = 1 then for all n 0, . . . , M\i] do set Ai-n ^ (1 - D[l])-"'Bi.o^n 

20 else 

21 for all n <— 0, . . . , M[i] do initialize Ai-n ^ 

22 for i ^ 0, . . . , M[i ~ 1] and s ^ 0, . . . , M^, do 

23 set Ai-n ^ A;n + Binomial(s; n, D[i - l])A,_i.tB^.t,s 

24 for aU n ^ 0, . . . , M[i\ do A,,n ^ (l - -D[i])~"^»;„ 

25 for all n <— 0, . . . , do set Lx[n] <^ Ac-.n- 



5 Mathematical proofs 

Proof of Lemma [21 Equations ([2a]) and f[3aj) are immediate since 

tL'[m|0] = ty*[m|0] = P{r; = m} = H{m). 
By the independence of Q, for all n > 0, 

n 

w[0\n] = F{ri + = 0} = w[0\n - l]P{Cn = 0} = w[0\n - 1] ■ 
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as in (|2bD. 

Let G{n) = ShiftedGeometric(n; p, g) be the common probability mass 
function of Q. For m,n > 0, 



n-l 



w[m\n] = P|?7 + ^0 = m| = ^P{Cn = k} ■ p|r/ + ^0 = m - 

i=l k=0 1=1 

m 

^^Gik) -wlm- k\n-l]. (13) 



k=0 



For m = 1, (fT^ is tantamount to (|2c|) . since G(0) = p and G'(l) = (1 — 
p)(l — q). For m > 1, f|T3l) can be further rewritten using G{k) = qG{k — 1) 
for all k > 1: 

w[m\n\ = G{0) ■ w[m\n — 1] + G'(l) • w[m — l\n — 1] 

m 

+ qG{k — 1) ■ w[m — k\n — 1] 

k=2 

= p ■ w[m\n — 1] + G'(l) ■ w[m — l|r2 — 1] 

m—l 

+ q G{k) ■ w[m — 1 — k\n — 1] 

k=l 

= p ■ w[m\n — 1] + G'(l) ■ w[m — l|r2 — 1] 
+ q(^w[m — l\n\ — G{0) ■ w[m — l|n — 1]^ , 

which leads to the recursion of pd|) since G{1) — qG{0) = 1 — p — q. 
Equation (l3bl) follows from 

n n 

w*[n\n] = P|r/+^G = VO > o} = F{r] = 0}-J]P{0 = 1} = //(O) (G'(l))". 

1=1 i=l 
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For m > n > 0, 



w*[m\n] = P|?7 + = m;Ci > o| 

1=1 

m—n+l 

= F{Cn = k}-w*[m-k\n-l] 

k=l 
m—n+l 

= J2 G{k)-w*[m-k\n-l] 



k=l 

m—n+l 



G{l)-w*[m-l\n-l]+ J2 QG{k-l) ■ w*[m — k\n — 1] 



k=2 



= G{1) ■ w*['m — l\n — 1] + q ■ w*[m\n — 1], 

as claimed in (l3cl) . □ 

Lemmas [2] and [3] rely on the following general result. 

Lemma 8. Let a & be a fixed parameter, and let {a„},^Q and {&n}5^o 
two number sequences related by the formula 



oo 



i=0 



br. = J2^^')^'i^~^r^n+^. (14a) 



(We use the convention 0° = 1 m the formula when a G {0, 1}.) Let A{z) = 
J2n'^riZ"' and B{z) = 'Ylm^nZ^ dcnotc the generating functions for the se- 
quences. Then 

B{z) = A{a + {I - a)z) (14b) 

Proof. If o" = 0, then 6„ = a„, and, thus fll4bl) holds. If a = 1, then 
6o = ^fclo^fc' ^^"^ = for > 0, which implies (I14bp . Otherwise, 



n=0 m=n ^ ' 
oo m ^ \ 

E<'™E-"("„>'"""(i-'^ 

m—n n—n V / 



m=0 n=0 

oo 

= + (1 - ^)^)" = ^(^ + (1 - ^)^) , 

m=0 

as claimed. □ 
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Corollary 9. Let {a„}J!^Q and {bn}'^=Q be two probability mass functions for 
non-negative integer random variables, related as in fll4ap . 

1. If an = ShiftedGeometric(n; p, g), then 6„ = ShiftedGeometric(n;p', g') 

with 

, p(i - (j) + (1 - g)(T , /-..x 

p = and q = . (15) 

1 — qa 1 — qa 

2. If an = NegativeBinomial(n; 6', g), then bn = Negativebinomial(n; 6^, g'), 
where q' is defined as in (ITSl) . 

3. If an = Poisson(?2; r), then bn = Poisson(n; r') with r' = r(l — a). 

Proof. The corollary follows for plugging into Lemma [8] the generating func- 
tions A{z) = ^^f_~l"^\ A{z) = and A{z) = e'^^-^^ for the shifted 
geometric, negative binomial, and Poisson distributions, respectively. □ 

Proof of Lemma [3 Let C be the random variable that is the size of an inpar- 
alog group at node y. In order to have n surviving inparalogs, there must 
be n + z inparalogs in total, out of which i do not survive, for some z > 0. 
Therefore, 

Gyin) = E P{C = + z} f + D;(1 - D., 



By Table [H P{C = n} = ShiftedGeometric( n; p, g) where the distribution pa- 
rameters p and g are determined by the parameters of the birth-and-death 
process on the edge leading to y (g = in the degenerate case where dupli- 
cation rate is 0). The lemma thus follows from Corollary [HI with a = Dy. □ 

Proof of Lemma O Let t] be the random variable that is the size of the 
xenolog group at y. In order to have n surviving xenologs, there must he n + i 
xenologs in total, out of which i do not survive, for some i > 0. Therefore, 

Hy{n) = f2Hv = n + ^}(J'^') D;(1 - Dy)-. (16) 

By Table [1] if A = 0, then r] has a Poisson distribution with parameter r; 
otherwise, rj has a negative binomial distribution with parameters 9 and g. 
In either case, the lemma follows from Corollary [9] after setting a = Dy. □ 
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Proof of Lemma [7| For leaves, the statement is trivial. When x is not a 
leaf, the lemma follows from the fact that survivals are independent between 
disjoint subtrees. □ 

Proof of Theorem O The formulas are obtained by tracking survival within 
the lineages xxi among the individuals at x. Define the indicator vari- 
ables Xjj for each individual j = 1, . . . , ^(x) and hneage i = 1, . . . , c, taking 
the value 1 if and only if individual j has at least one surviving offspring at Xi . 
In order to work with the survivals, we introduce some auxiliary random vari- 
ables that have the distributions of surviving xenologs and inparalogs. For 
every edge xXi, define the sequence of independent random variables (Cij : j = 
1,2,...) with identical distributions G^. (■), and the random variable T]i with 
distribution H^X')- Consequently, = O} = PjOj = O} = G^^O), 



and P|s(x,) = k ^{x) = n} = ¥^^{r]i + EJ=i Oi) = By Lemma El 
Gx^{k) = ShiftedGeometric(A;; p', g') with some p' and q'. 

Define the shorthand notation $j = |(^(T^.) = $(T^.)| for observing the 
counts in the subtree rooted at Xj. Let 

s 



In other words, if there are t + s individuals at x, then Bi-f ,, is the probability 
that s selected individuals survive in the lineage xXi and the given profile is 
observed in T^^ . By Lemma [H 



s 



m=0 



^(x) = s 



s 



m 



^iXi = m 



^ 7 = 1 



I -m]\/j <s: Cij ^ 



m\ ■ w^.\m\s\ 
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as shown in fl8aD. If t > 0, then 



s 
s 

s 

= G^XO) ■ P|$i; = s I = t + s - l} 

s+l 



G'(O) • Bi.t-l,s + -Bj;t-1. 



s+l, 



which is tantamount to fl8c|) . Equation fl8bp follows from the fact that Bi.^t,s = 
for s > . 

For every i, let denote the set of individuals at x that survive in at 
least one of the lineages Xi, . . . , Xi, i.e., = |j : {-^ij + ■ ■ ■ + 7^ 0}|. 

Given the exchangeabihty of individuals, p|$i; . . . ; $j = is the same 
for all sets of the same size \^\ = n. Let 

A;n = p{$i;... I m = ny 

In particular, l^d = and, thus, Ac-n = Lx[n\. 

Since Ai-n is the same for all values of ^{x) > n, and $1 is determined 
solely by survival in lineage xxi, 



n 



1 - D[l]rp{<I>i I l^il = n} = (1 - D[l]rA,,^, 



implying fl9aj) . 
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By a similar reasoning for i > 1, 



= p||y,| = n I = n} -Pj^i;... ; <l>i | |y,| = n; ^(x) = n} (17) 

= (1-D[Z])"A;„ 

Let Xj denote the set of individuals that survive in the lineage xxf. Xi = 
|j: Xij 7^ O}. We rewrite the left-hand side of (1171) by conditioning on the 
set of individuals that survive in lineage xxi but not in the lineag 

= J2 P{$oXAy.-i = s|e(2;) = r2} 

X P|$i; . . . ; = M \ § ^(x) = nj 

s+t=n 



s+t=n 



Combining this latter equality with ( |T7I) leads to ( I9b[) . 



□ 



Proof of Theoreml^ By (jl]), Lemma [8] with a = shows the relation- 
ship between the generating functions for the distributions of ^(x) and S(x). 
The theorem considers the case when x is the root and Corollary applies 
to 7(n) = ¥{i{x) = n}. □ 

Proof of Theorem By Theorem [5], the algorithm correctly computes the 
conditional survival likelihoods. Let T be the phylogeny with root p and n 
nodes. In order to prove the running time result, consider first the loop 
of Line [1] Lines [2H5] take 0(1) time for each x G V{T). Line [8] is exe- 
cuted (Mj -|- 1)^ times for every non-root x. If x is an inner node with 
children xi,X2, - ■ ■ ,Xc, then = X]^=i ^^j- Consequently, 



5^(M,^. + l)2<(M,. + l)2 + (c-l) 



(18) 
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Now, consider the tree levels Vq, V^, where Vq = {p}, and for all i = 1, . . . , h, 
Vj consists of all the children of nodes in 'Vj„i. In other words, Vj is the set 
of nodes that are reached through i edges from the root. By f|T8l) . 

J2iMy + If < |V,| - |V,_i| + (^^- + 
for alH > 0. Therefore, 

j2iMy+ir<{\v.\-i) + {M,+ir (19) 

So, 

(m.+i)^=x:e(^^^+i)' 

x£V(T)\{p} i=l xeV,; 

<n-l-h + h{Mp + 1)2 
= 0{n + hM'^). 

Therefore, executing Line [H] through all iterations takes 0(^M'^h + n) time. In 
order to bound the loop's running time in Line [HI consider the Bi-t^^ and Ai-n 
values that are needed for a given node x with children xi, . . . ,Xc. By fl8a|) . 
computing all -Bi;o,s values takes 0{{Mxi + 1)^) time. Every Bi.t,s with t > 
and v4j;„ is calculated in 0(1) time. Using the bound M[i] < M^, iteration i 
of the loop in Line [TBI takes 0((M^ + l){Mx. + 1)) time. By summing for i = 
1, . . . , c, we get that for node x with c^, children, the loop of Line M takes 
0((M, + 1)(M, + c,)) time (since J2iiM,. + 1) = + c). Now, (M, + 
1)(M, + c,) = (M, + 1)(M, + 1 + c - 1), and 

/i 

5^ (M, + l)(c, _ 1) = ^ ^(M, + i)(c, - 1) 

h 

< (max c,-l)Y,Yl + ^) 

i=o xeVi 

< (c*-l)(Mp(/i + l)+n). 

By our previous discussions, 

Y iMx + l){M^ + c^) <n-l-h + h{Mp + if + (c* - l){Mp{h + 1) + n) 

xeV(T) 

= 0{M'^h + c*{Mh + n)). 
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So, the loop of Line [H] takes 0{M'^h + Mhc*) time, which leads to the The- 
orem's claim when combined with the bound on the loop's running time in 
Line [H □ 



6 Likelihood correction for absent profiles 

Suppose that profiles are restricted to the condition that {$(a;) > O} must 
hold for at least one terminal node x. The corresponding likelihoods 

Li = pjvx G L{T) : ^{x) = ^{x) ^{x) > for at least one leafj 

are obtained from the full likelihood by employing a correction that involves 
the probability of the condition |Fel92j . Namely, 

r T 



p|^(x) > for at least one leaf| 1 — p|.^(x) = for all leaves x| 

The probability that C,{x) =0 at all the leaves is the likelihood of the all- 
profile $0 = (0,...,0). By Theorem [5l L^O] = UxyeE{T) HyiO) for the 
profile $0- Combined with (fT2!) . we have the correction formula Li = 
with 



Po = 

for 7(n) = Poisson(n; r). 



n ^.(0)) ■exp(r(l-D,)), (20) 



xyeE(T) 



7 Inferring family sizes at ancestors and count- 
ing lineage-specific events 

Given a profile $, the posterior probabilities for gene family size at node x are 
computed by using the conditional survival likelihoods Lrc[n\ and likelihoods 
of some relevant profiles on truncated phylogenies. In order to compute the 
gene content at node x, for example, consider the profile ^x-.m for all m 
that applies to a phylogeny obtained by pruning the edges below x, that is, 
^x:m{y) = ^{y) for y ^ Tx and ^x:m{x) = m. Let Lx-.m denote the likelihood 
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of ^x-.m on the pruned tree. Then 

P|e(x) = m ^(£(7)) = $1 = ^■"^"=0^"^^ ^-^^ ^ — ^ 

gives the posterior probabihty that the family had m homologs at node x. 
The number of famihes present at node x, denoted by A^^^, is inferred as a 
posterior mean value by summing posterior probabilities: 



Nx 



n 

J]p{e(x)>o|e(i:(r)) = <i>.} 

i=l 

+ ^vh{x)>0 e(^(T))=<fo|, (21) 

where $j : i = 1, . . . , ?t, are the profiles in the data set and po is the likelihood 
of the all-0 profile $o from fl20l) . Notice that the formula includes the absent 
all-0 profiles; there are such profiles by expectation. 

For each edge xy^ the posterior probabilities for gain^ loss, expansion and 
contraction are: 

P{gain(xy)} = P{e(x) = 0,e(z/) > 0} = P{e(x) = O} - P{e(x) = 0,e(?/) 

p{ioss(xy)} = p{e(x) > o,ay) = 0} = F{ay) = 0} - F{^{x) = o,ay) 

P { expansion (xy)} = P{^(x) = l,^{y) > l} = ^{^{x) = l} 

-p{^(a;) = i,ay) = 0} -P{e(x) = i,ay) = 1} 

P{contraction(x?/)} = F{^{x) > l,^{y) = l} = P{^(l/) = l} 

- ¥{ax) = 0, ay) = 1} - na^) = i, = i}, 

where all probabilities are conditioned on the observation of the phylogenetic 
profile {^(£(T)) = $}. Expected numbers for gains, losses, expansions and 
contractions on each edge xy are computed by formulas analogous to fl2Tl) . 

Posterior probabilities of the general form p|^(x) = n^^y) = m 

.^(XL(T)) = $|, characterizing lineage-specific family size changes on edge xy, 
can also be computed by using survival likelihoods on truncated phylogenies. 
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Figure 1: Decomposition of the phylogeny in order to compute posterior 
probability for lineage-specific events. Equation fl22l) shows the factors cor- 
responding to the three parts. 



In particular, we decompose the events as 



P|^(x) = n,^{y) = m,^{L{T)) = $} = I x II x III 

= p{^(x)=n,C(/:(T)\L(T.)) =$(£(T)\£(T.))} (22) 
xP{e(?/) = m,e(£(T,)) = ^mTy)) I = n} 
xp|e(L(T,.) \ £(T,)) = $(£(TJ \ L{Ty)) | ^(x) = n}, 

where the second factor can be written as 

= m,CmTy)) = $(£(T,)) I = n] 



k=0 



Figure [T] illustrates the decomposition of the phylogeny into three parts, 
corresponding to the three factors in (!22l) . 
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